Goto

Collaborating Authors

 medical education


MedTutor-R1: Socratic Personalized Medical Teaching with Multi-Agent Simulation

He, Zhitao, Yang, Haolin, Qin, Zeyu, Fung, Yi R

arXiv.org Artificial Intelligence

The significant gap between rising demands for clinical training and the scarcity of expert instruction poses a major challenge to medical education. With powerful capabilities in personalized guidance, Large Language Models (LLMs) offer a promising solution to bridge this gap. However, current research focuses mainly on one-on-one knowledge instruction, overlooking collaborative reasoning, a key skill for students developed in teamwork like ward rounds. To this end, we develop ClinEdu, a multi-agent pedagogical simulator with personality-driven patients and diverse student cohorts, enabling controlled testing of complex pedagogical processes and scalable generation of teaching data. Based on ClinEdu, we construct ClinTeach, a large Socratic teaching dialogue dataset that captures the complexities of group instruction. We then train MedTutor-R1, the first multimodal Socratic tutor designed for one-to-many instruction in clinical medical education. MedTutor-R1 is first instruction-tuned on our ClinTeach dataset and then optimized with reinforcement learning, using rewards derived from a three-axis rubric, covering structural fidelity, analytical quality, and clinical safety, to refine its adaptive Socratic strategies. For authentic in-situ assessment, we use simulation-based interactive evaluation that redeploys the tutor back into ClinEdu. Experimental results demonstrate that our MedTutor-R1 outperforms the base model by over 20% in average pedagogical score and is comparable to o3, while also exhibiting high adaptability in handling a varying number of students. This promising performance underscores the effectiveness of our pedagogical simulator, ClinEdu.


Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts

Zhu, Siyu, Bian, Mouxiao, Xie, Yue, Tang, Yongyu, Yu, Zhikang, Li, Tianbin, Chen, Pengcheng, Han, Bing, Xu, Jie, Dong, Xiaoyan

arXiv.org Artificial Intelligence

With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased, revealing limitations in complex reasoning. Multiple-choice assessments highlighted weaknesses in integrative reasoning and knowledge recall. In dynamic diagnosis and treatment scenarios, DeepSeek-R1 scored highest in case reasoning (mean 0.58), yet most models struggled to adapt to real-time patient changes. On pediatric medical ethics and safety tasks, Qwen2.5-72B performed best (accuracy 92.05%), though humanistic sensitivity remained limited. These findings indicate that pediatric LLMs are constrained by limited dynamic decision-making and underdeveloped humanistic care. Future development should focus on multimodal integration and a clinical feedback-model iteration loop to enhance safety, interpretability, and human-AI collaboration. While current LLMs cannot independently perform pediatric care, they hold promise for decision support, medical education, and patient communication, laying the groundwork for a safe, trustworthy, and collaborative intelligent pediatric healthcare system.


Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D

Gupta, Arsh, Sridhar, Ajay Narayanan, Mingole, Bonam, Yadav, Amulya

arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated capabilities across diverse domains, yet their performance on rare disease diagnosis from narrative medical cases remains underexplored. We introduce a novel dataset of 176 symptom-diagnosis pairs extracted from House M.D., a medical television series validated for teaching rare disease recognition in medical education. We evaluate four state-of-the-art LLMs such as GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro on narrative-based diagnostic reasoning tasks. Results show significant variation in performance, ranging from 16.48% to 38.64% accuracy, with newer model generations demonstrating a 2.3 times improvement. While all models face substantial challenges with rare disease diagnosis, the observed improvement across architectures suggests promising directions for future development. Our educationally validated benchmark establishes baseline performance metrics for narrative medical reasoning and provides a publicly accessible evaluation framework for advancing AI-assisted diagnosis research.


CLiVR: Conversational Learning System in Virtual Reality with AI-Powered Patients

Amithasagaran, Akilan, Dakshit, Sagnik, Suryadevara, Bhavani, Stockton, Lindsey

arXiv.org Artificial Intelligence

Simulations constitute a fundamental component of medical and nursing education and traditionally employ standardized patients (SP) and high-fidelity manikins to develop clinical reasoning and communication skills. However, these methods require substantial resources, limiting accessibility and scalability. In this study, we introduce CLiVR, a Conversational Learning system in Virtual Reality that integrates large language models (LLMs), speech processing, and 3D avatars to simulate realistic doctor-patient interactions. Developed in Unity and deployed on the Meta Quest 3 platform, CLiVR enables trainees to engage in natural dialogue with virtual patients. Each simulation is dynamically generated from a syndrome-symptom database and enhanced with sentiment analysis to provide feedback on communication tone. Through an expert user study involving medical school faculty (n=13), we assessed usability, realism, and perceived educational impact. Results demonstrated strong user acceptance, high confidence in educational potential, and valuable feedback for improvement. CLiVR offers a scalable, immersive supplement to SP-based training.


The Integration of Artificial Intelligence in Undergraduate Medical Education in Spain: Descriptive Analysis and International Perspectives

Janeiro, Ana Enériz, Pereira, Karina Pitombeira, Mayol, Julio, Crespo, Javier, Carballo, Fernando, Cabello, Juan B., Ramos-Casals, Manel, Corbacho, Bibiana Pérez, Turnes, Juan

arXiv.org Artificial Intelligence

AI is transforming medical practice and redefining the competencies that future healthcare professionals need to master. Despite international recommendations, the integration of AI into Medicine curricula in Spain had not been systematically evaluated until now. A cross-sectional study (July-September 2025) including Spanish universities offering the official degree in Medicine, according to the 'Register of Universities, Centers and Degrees (Registro de Universidades, Centros y Títulos RUCT)'. Curricula and publicly available institutional documentation were reviewed to identify courses and competencies related to AI in the 2025-2026 academic year. The analysis was performed using descriptive statistics. Of the 52 universities analyzed, ten (19.2%) offer specific AI courses, whereas 36 (69.2%) include no related content. Most of the identified courses are elective, with a credit load ranging from three to six ECTS, representing on average 1.17% of the total 360 credits of the degree. The University of Jaén is the only institution offering a compulsory course with AI content. The territorial analysis reveals marked disparities: Andalusia leads with 55.5% of its universities incorporating AI training, while several communities lack any initiative in this area. The integration of AI into the medical degree in Spain is incipient, fragmented, and uneven, with a low weight in ECTS. The limited training load and predominance of elective courses restrict the preparation of future physicians to practice in a healthcare environment increasingly mediated by AI. The findings support the establishment of minimum standards and national monitoring of indicators.


A Human-Centered Approach to Identifying Promises, Risks, & Challenges of Text-to-Image Generative AI in Radiology

Morrison, Katelyn, Mathur, Arpit, Bradshaw, Aidan, Wartmann, Tom, Lundi, Steven, Zandifar, Afrooz, Dai, Weichang, Batmanghelich, Kayhan, Eslami, Motahhare, Perer, Adam

arXiv.org Artificial Intelligence

As text-to-image generative models rapidly improve, AI researchers are making significant advances in developing domain-specific models capable of generating complex medical imagery from text prompts. Despite this, these technical advancements have overlooked whether and how medical professionals would benefit from and use text-to-image generative AI (GenAI) in practice. By developing domain-specific GenAI without involving stakeholders, we risk the potential of building models that are either not useful or even more harmful than helpful. In this paper, we adopt a human-centered approach to responsible model development by involving stakeholders in evaluating and reflecting on the promises, risks, and challenges of a novel text-to-CT Scan GenAI model. Through exploratory model prompting activities, we uncover the perspectives of medical students, radiology trainees, and radiologists on the role that text-to-CT Scan GenAI can play across medical education, training, and practice. This human-centered approach additionally enabled us to surface technical challenges and domain-specific risks of generating synthetic medical images. We conclude by reflecting on the implications of medical text-to-image GenAI.


AI-Powered Detection of Inappropriate Language in Medical School Curricula

Salavati, Chiman, Song, Shannon, Hale, Scott A., Montenegro, Roberto E., Dori-Hacohen, Shiri, Murai, Fabricio

arXiv.org Artificial Intelligence

The use of inappropriate language--such as outdated, exclu-sionary, or non-patient-centered terms--in medical instructional materials can significantly influence clinical training, patient interactions, and health outcomes. Despite their reputability, many materials developed over past decades contain examples now considered inappropriate by current medical standards. Given the volume of curricular content, manually identifying instances of inappropriate use of language (IUL) and its subcategories for systematic review is prohibitively costly and impractical. To address this challenge, we conduct a first-in-class evaluation of small language models (SLMs) fine-tuned on labeled data and pre-trained LLMs with in-context learning on a dataset containing approximately 500 documents and over 12,000 pages. For SLMs, we consider: (1) a general IUL classifier, (2) subcategory-specific binary classifiers, (3) a multilabel classifier, and (4) a two-stage hierarchical pipeline for general IUL detection followed by mul-tilabel classification. For LLMs, we consider variations of prompts that include subcategory definitions and/or shots. We found that both LLama-3 8B and 70B, even with carefully curated shots, are largely outperformed by SLMs. While the multilabel classifier performs best on annotated data, supplementing training with unflagged excerpts as negative examples boosts the specific classifiers' AUC by up to 25%, making them most effective models for mitigating harmful language in medical curricula.


LLM-Powered Virtual Patient Agents for Interactive Clinical Skills Training with Automated Feedback

Voigt, Henrik, Sugamiya, Yurina, Lawonn, Kai, Zarrieß, Sina, Takanishi, Atsuo

arXiv.org Artificial Intelligence

Objective Structured Clinical Examinations (OSCEs) are essential for medical training, but they require significant resources, including professional actors and expert medical feedback. Although Large Language Models (LLMs) have introduced text-based virtual patients for communication practice, these simulations often lack the capability for richer, non-textual interactions. This paper presents a novel framework that significantly enhances LLM-based simulated patients by equipping them with action spaces, thereby enabling more realistic and dynamic patient behaviors that extend beyond text. Furthermore, our system incorporates virtual tutors that provide students with instant, personalized feedback on their performance at any time during these simulated encounters. We have conducted a rigorous evaluation of the framework's real-time performance, including system latency and component accuracy. Preliminary evaluations with medical experts assessed the naturalness and coherence of the simulated patients, as well as the usefulness and appropriateness of the virtual tutor's assessments. This innovative system provides medical students with a low-cost, accessible platform for personalized OSCE preparation at home.


Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesis

Long, Xi, Boscardin, Christy, Maggio, Lauren A., Costello, Joseph A., Gonzales, Ralph, Hammoudeh, Rasmyah, Lai, Ki, Park, Yoon Soo, Gin, Brian C.

arXiv.org Artificial Intelligence

Knowledge syntheses (literature reviews) are essential to health professions education (HPE), consolidating findings to advance theory and practice. However, they are labor-intensive, especially during data extraction. Artificial Intelligence (AI)-assisted extraction promises efficiency but raises concerns about accuracy, making it critical to distinguish AI 'hallucinations' (fabricated content) from legitimate interpretive differences. We developed an extraction platform using large language models (LLMs) to automate data extraction and compared AI to human responses across 187 publications and 17 extraction questions from a published scoping review. AI-human, human-human, and AI-AI consistencies were measured using interrater reliability (categorical) and thematic similarity ratings (open-ended). Errors were identified by comparing extracted responses to source publications. AI was highly consistent with humans for concrete, explicitly stated questions (e.g., title, aims) and lower for questions requiring subjective interpretation or absent in text (e.g., Kirkpatrick's outcomes, study rationale). Human-human consistency was not higher than AI-human and showed the same question-dependent variability. Discordant AI-human responses (769/3179 = 24.2%) were mostly due to interpretive differences (18.3%); AI inaccuracies were rare (1.51%), while humans were nearly three times more likely to state inaccuracies (4.37%). Findings suggest AI variability depends more on interpretability than hallucination. Repeating AI extraction can identify interpretive complexity or ambiguity, refining processes before human review. AI can be a transparent, trustworthy partner in knowledge synthesis, though caution is needed to preserve critical human insights.


A Fuzzy Supervisor Agent Design for Clinical Reasoning Assistance in a Multi-Agent Educational Clinical Scenario Simulation

Zheng, Weibing, Turner, Laurah, Kropczynski, Jess, Ozer, Murat, Overla, Seth, Halse, Shane

arXiv.org Artificial Intelligence

Assisting medical students with clinical reasoning (CR) during clinical scenario training remains a persistent challenge in medical education. This paper presents the design and architecture of the Fuzzy Supervisor Agent (FSA), a novel component for the Multi-Agent Educational Clinical Scenario Simulation (MAECSS) platform. The FSA leverages a Fuzzy Inference System (FIS) to continuously interpret student interactions with specialized clinical agents (e.g., patient, physical exam, diagnostic, intervention) using pre-defined fuzzy rule bases for professionalism, medical relevance, ethical behavior, and contextual distraction. By analyzing student decision-making processes in real-time, the FSA is designed to deliver adaptive, context-aware feedback and provides assistance precisely when students encounter difficulties. This work focuses on the technical framework and rationale of the FSA, highlighting its potential to provide scalable, flexible, and human-like supervision in simulation-based medical education. Future work will include empirical evaluation and integration into broader educational settings. More detailed design and implementation is open sourced here.